Wikipedia Arborification and 
Stratified Explicit Semantic Analysis 

IX, 

^ Yannis Haralambous 1 Vitaly Klyuev 2 



(T) Institut Telecom, Telecom Bretagne & UMR CNRS 6285 Lab-STICC, Technopole Brest Iroise, CS 83818, 

29238 Brest Cedex 3, France 
(2) University of Aizu, Aizu-Wakamatsu, Fukushima-ken 965-8580, Japon 
yannis . haralambousQtelecom-bretagne . eu, vkluev@u-aizu.ac . jp 

Resu me 

Nous presentons une extension du precede d'analyse semantique explicite de Gabrilovich et 
]\f^kovitch. A I'aide de leur mesure de parente semantique, nous ponderons le graphe des 
camgories de Wikipedia. Puis, nous en extrayons un arbre couvrant minimal par le biais de 
l'alaorithme de Chu-Liu & Edmonds. Nous definissons une notion de tfidf stratifie, les strates 
e^&rjt, pour une page Wikipedia et un terme donnes, le tfidf classique et les tfidfs categoriels dans 
le££ategories ancetres, au sens de l'arbre couvrant minimal. Notre methode se sert de ce tfidf 
s^talfifie, qui favorise les termes qui « survivent » lorsque on passe des pages aux categories, en se 
dingeant vers la racine de l'arbre. Nous l'evaluons par une classification de textes tires du corpus 
WoMNews, et constatons qu'elle apporte un gain de precision de 18%. Nous terminons par une 
sejl£ de pistes de recherches futures. 

ABSTRACT 

[^PH^s is the translation of paper Arborification de Wikipedia et analyse semantique explicite 
stfatifiee submitted to TALN 2012.] 

We present an extension of the Explicit Semantic Analysis method by Gabrilovich and Markovitch. 
Using their semantic relatedness measure, we weight the Wikipedia categories graph. Then, 
we extract a minimal spanning tree, using Chu-Liu-Edmonds' algorithm. We define a notion of 
stratified tfidf where the stratas, for a given Wikipedia page and a given term, are the classical 
tfidf and categorical tfidfs of the term in the ancestor categories of the page (ancestors in the 
sense of the minimal spanning tree). Our method is based on this stratified tfidf, which adds extra 
weight to terms that "survive" when climbing up the category tree. We evaluate our method by a 
text classification on the WikiNews corpus : it increases precision by 18%. Finally, we provide 
hints for future research. 
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[This is the translation of paper Arborification de Wikipedia et analyse semantique explicite stratifiee 
submitted to TALN 2012.] 



1 Introduction 

1.1 Explicit Semantic Analysis 

Unlike semantic similarity measures, which are limited to ontological relations such as synonymy, 
hyponymy, meronymy, etc., semantic relatedness measures detect and quantify semantic relations 
of a more general kind. The typical example is the one of concepts car, vehicle and gasoline. 
A car is a special kind of a vehicle, so we have an hyperonym relation and this can easily be 
measured by a semantic similarity measure (for example, by taking the inverse of the length of 
the shortest path between these concepts in WordNet) . But between car and gazoline, there is 
no semantic similarity, since a car is a solid object and fuel is a liquid. But there is an obvious 
semantic relation between them since most cars use gasoline as their energy source, and this 
relation will be measured by the semantic relatedness measure. 

QGabrilovich and Markovitch"| |2007| ) introduce a semantic relatedness measure called ESA (= Ex- 



plicit Semantic Analysis), as opposed to the popular method of latent semantic analysis (Deer 



wester et ah] 1 1 989 1 . ESA is based on the Wikipedia corpus. The principle is simple: after cleaning 



and filtering Wikipedia pages (keeping only those with a sufficient amount of text and a given 
number of incoming and outgoing links), they remove stop words, stem all words and calculate 
their tfidfs. A Wikipedia page can then be represented as a vector in the space of (nonempty, 
stemmed, distinct) words; the coordinates of the vector are the tfidf values (normalized so as to 
have unit vectors and thus remain independent of the size of the page). 

By the very nature of Wikipedia, we can consider that every page is a concept. All the concepts 
thus form a matrix whose columns are concepts and whose lines are words. By transposing this 
matrix we obtain a representation of words in the space of concepts. The ESA measure of two 
words is simply the cosine of their vectors in this space. 

Intuitively, two words are ESA-semantically close if they appear frequently in the same Wikipedia 
pages (so that their tfs are high), and rarely in the whole corpus (for their dfs to be low). 

Despite the good results obtained by this method, it has given rise to some criticisms. Thus, in 



(Haralambous and Klyuev 2011 1 we notice that ESA has poor performance when the relation 
between words is mainly ontological. As an example, the word "mile" (length unit) does not 
appear in the page of the word "kilometer" and the latter appears only once in the page of the 
former: that is hardly sufficient to establish a nonzero semantic relatedness value; however, such 
a relation is obvious, since both words refer to units of length measurement. An ontological 
component, obtained from a WordNet-based measure can fill this gap. 



Another, more fundamental criticism is that of (Got tron et al.||20TT l, who argue that the choice 



of Wikipedia is irrelevant, and that any corpus of comparable size would give the same results. 
To prove it, they base ESA not on Wikipedia, but on a Reuters news corpus, and get results 
even better than with standard ESA. According to the authors, the semantic relatedness value 
depends only on the collocational frequency of the terms, and this whether documents correspond 
to concepts or not. In other words they deny the "concept hypothesis," saying that ESA uses 



specifically the correspondence between concepts and Wikipedia pages. 



In this article we will enhance ESA by adopting a different approach: the persistence of tfidfs of 
terms when leaving pages and entering the category graph. 



1.2 Wikipedia Categories 

Wikipedia pages are generally well written and structured, but unfortunately this is not always 
the case for page categories. This is due to two factors: first, the choice of categories is the result 
of a collective effort, and this is not always the ideal condition for obtaining a coherent result 
(not to mention the fact that writing a Wikipedia page requires good knowledge of a given topic 
while the choice of a category requires good knowledge of Wikipedia as a whole, which is less 
common among wikinauts). Second factor, there is no strict separation between thematic and 
utilitarian categories. For example, categories are used both to classify pages according to their 
themes, as to identify pages being too short or needing corrections or are likely to be deleted, 
and so on. 

Finally, what makes NLP applications more difficult is the fact that categories form a fairly 
complex graph, which, in particular, contains cycles. Thus, according to (Medely an et al.| [2009), 
"cycles are not encouraged but may be tolerated in rare cases." The very simple example of 
categories "Zoologie" and "Animal" pointing to each other, shows that the semantic relation 
underlying subcategories is always hyperonymy. Here animal is the object of study of discipline 
zoologie. We attempted the following experiment: starting from the 2,782,242 (unfiltered) 
French Wikipedia pages, we followed paths formed by the category links. The choice of each 
subsequent category was made at random, but did not change during the experiment. Among 
these paths, 2,162,115 resulted in cycles. It turned out that it was always the same 50 cycles, 12 
of which were of length 3 (triangles) and all others of length 2 (categories pointing at each other, 
as in the example above, which was detected by this method). It results from our experiment 
that the problem is real but not insurmountable, since the number of cycles probably can be kept 
within reasonable limits. 



1.3 Related Work 



( Scholl et al. 2010 1 enhance the performance of ESA using categories. They proceed as follows: 
let T be the matrix whose rows represent the Wikipedia pages and whose columns represent 
words. The value ty of cell (i, ;') is the normalized tfidf of the jth word in the ith page. For each 
word m there is therefore a vector v m whose dimension is equal to the number of pages. Now let 
C be the matrix whose columns are pages and whose lines are categories. The value of a cell c ; < 
is 1 when page j belongs to category i and otherwise. Then they take the product of matrices 
v m ■ C which provides a vector whose jth component is Sip-ec *-iJ> t ^ iat * s t ^ ie sum °^ tfidfs °f 
word m for all pages belonging to the j'th category. They use the concatenation of vector v m and 
of the transpose of v m ■ C to improve system performance on the text classification task. They call 
this method XESA (extended ESA). 

We see that Scholl et al. extend page tfidf to categories by simply taking the sum of tfidfs of all 
pages belonging to a given category. This approach has a disadvantage when it comes to high- 
level categories: instead of being a way to find the words that characterizing a given category, the 



tfidf of a word tends to become nothing more than the average density of the word in the corpus, 
since for large categories, tf tends to the total number of occurrences of the word in the corpus, 
while the denominator remains constant and equal to the number of documents containing the 
given word. Thus, this type of tfidf loses its power of discrimination for high-level categories. We 
propose another extension of tfidf to categories, which we call categorical tfidf. The difference 
lies in the denominator, where we take, the number not of all documents containing the term, 
but only of those not belonging to the category. Thus our categorical tfidf (which is equal to 
the usual tfidf in the case of pages) is high when the term is common in the category and rare 
elsewhere (as opposed to rare on the entire corpus of Scholl et al.). 



On the other hand, in (Coll in et al.| 20101, the authors pose the problem of inconsistency of 



Wikipedia's category graph and propose a shortest path approach (based on the number of edges) 
between a page and the category "Article," which is at the top of the hierarchy. The shortest path 
provides them with a semantic and thematic hierarchy. 



However, as already observed in the case of WordNet (Leacock and Chodorow 1998 p. 275), the 
length (in number of edges) of the shortest path can vary randomly, depending on the density of 
pages (concepts, in the case of WordNet) in a given domain of knowledge. On the other hand, 
the distance (in number of edges) between a leaf and the top of the hierarchy is often quite short, 
requiring frequently an arbitrary choice between paths of equal length. 

What is common with our approach is the intention to simplify Wikipedia's category graph. But 
instead of taking the number of edges, we weight the graph using ESA measure and utilize that 
weight for our simplification. This weight, which is based on the statistical presence of words 
in pages belonging to a given category, allows us to calculate a minimum spanning tree. The 
result of this operation is that any page (or category, other than "Article") has exactly one parent 
category that is semantically closest to it. This happens in a global sense, that is to say, so that 
the total weight of the tree is minimum. 

We use this tree to define a notion of stratified tfidf. Our goal is to avoid words which, by chance, 
have a high tfidf in a given page despite the fact that they do not really belong to the theme 
of the page. Our hypothesis is that a word having an unduly high tfidf will disappear when we 
calculate its tfidf in higher categories. Only the words in line with the theme of the page "survive" 
when we move away from leaves of the tree and towards the root. 



In this way the "concept hypothesis," denied by ( Gottron et al.|[201l] ), is again essential: when 



there is no concept, there is no hierarchy and when there is no hierarchy, survival of words in 
categories would be random and not due to the inherent hierarchy of the path. 



2 Adaptation of ESA to French Wikipedia 



To adapt ESA to French Wikipedia, we followed the same steps as ( Gabrilovich and Markovitch 



2007| ) and (Calh 2010) except for one thing: we have preceded the stemming step by lemma- 
tization, to avoid loss of information due to poor stemming of inflected words (in English this 
phenomenon is negligible, so that stemming can be performed directly). 



By limiting the minimum size of pages to 125 (nonstop, stemmed and distinct) words, 15 
incoming and 15 outgoing links, we obtained a number of Wikipedia pages equivalent to the 



original ESA method: 128,701 pages (out of 2,782,242 in total) containing 1,446,559 distinct 
words (only 339,679 of which appear more than three times in the corpus). 

( Gabr ilovich and Markovitch| |2007| ) evaluate their method on WS-353, a set of 352 pairs of 
English words, the semantic relatedness of which has been evaluated by 15-16 human judges. 
Their criterion is the Spearman correlation coefficient between the rank of pairs obtained by 
ESA and the one obtained by taking the average of human judgments. We have translated 
these pairs into French, but the result was very disappointing. Indeed, some twenty words are 
untranslatable into a single term (the current version of ESA covers only single-word terms), such 
as "seafood" which can be translated only as "fruits de mer." Furthermore there are ambiguities of 
translation resulting from word polysemy. When we translate the pair "flight/car" by "vol/voiture," 
we obtain a high semantic relatedness due to the criminal sense of "voZ" (= theft) while the 
sense of the English word "flight" is mainly confined to the domain of aviation. Finally, some 
obvious collocations disappear when translating word to word, such as "soap/opera" which is 
unfortunately not comparable to "savon/ opera". . . 

We have therefore chosen to evaluate our implementation of ESA in a more traditional way, by 
performing a text classification task. From the French WikiNews corpus, we have extracted four 
classes of documents corresponding to the following topics: politics, economics, culture, sports. 
Here are the characteristics of our test corpus: 



Class 


Number of text 


Number of words 


Number of ns.s.d. words 


Politics 


466 


70,829 


5,537 


Economy 


710 


97,610 


6,951 


Culture 


327 


46,812 


6,063 


Sports 


1,671 


166,174 


7,792 


Total 


3,174 


381,425 


16,323 



where the 4th column contains the number of nonstop stemmed distinct words. 

We represent each document as follows: let w be a word in document d, w the vector of the word 
in the space of concepts, t d (w) its tfidfj^in d. Then the vector d of document d defined as: 



d := 



v 7 !^ t2 d w 



where the denominator is used for normalization. In this way, all d are of (Euclidean) norm 1, 
and thus belong to the sphere § N where JV is the number of Wikipedia pages (in our case: 
N= 128, 701). 

We applied a linear SVM classifier to the set of these vectors and the corresponding classes, and 
after a tenfold cross-validation, we obtained the following result: 



Dim. of subspace 


Number of SVs 


Precision 


103,291 


665 


70.08% 



where the dimension of the subspace of concepts is the size of the smallest subspace that 
contains the vectors of all documents in our corpus. The precision obtained is rather low, which 



1. ( Gabrilovich and Markovitch 
frequency ot w in page d, #W is the 



2007f define tfidf as t d (w) = (1 + log(/ d (w))) ■ log ( where f d (w) is the 



total number of Wikipedia pages and df(w) is the number of pages containing w. 



Log-distribution of ingoing degrees 



Log-distribution of outgoing degrees 
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log(ingoing degree) log(outgoing degree) 

Figure 1 : Ingoing and outgoing degree distribution of Wikipedia categories. 



is probably — and at least partly — due to the small size of texts in the corpus. That said, our goal 
is not to compare ESA to other classification methods, but to show that our approach improves 
ESA. So, this result is our starting point and we intend to improve it. 



3 Wikipedia Arborification 



Wikipedia's category graph has been studied thoroughly in ( [Zesch and Gurevych] 20071, for the 



English version of Wikipedia. French Wikipedia (version of December 31, 2011), has 2,782,242 
pages, 293,244 categories, 680,912 edges between categories and 12,935,688 edges between 
pages and categories (note that the same category may have incoming edges from both pages 
and other categories). As can be seen on Fig. [T] by the logarithmic distribution of incoming 
and outgoing degrees, we see that this graph follows a power distribution p~ a for a = 2.08 for 
incoming degrees (a pretty standard value) and a = 7.51 for outgoing degrees (this value is 
rather high) . 

These calculations were made on the entire Wikipedia category graph. As explained in the 
previous section, to calculate ESA we filtered the corpus and selected only the most important 
pages. This also decreased the number of categories. Furthermore we removed the auxiliary 
categories (but not "portals" since they carry semantics). In the end, remained 104,258 categories 
for 128,701 pages, with 1,605,946 edges, 387,146 of which were edges between categories. 

We first calculated the pages that belong to every category (or to its subcategories), i.e., for a 
given category c, all pages d such that there is a membership d — » c x and a sequence of inclusions 
c x —> c 2 —>•••—> c. Let «^(c) be the set of these pages (= the set of "leaves of c"), W the entire 
(filtered) set of Wikipedia pages and #W its cardinal. If w e d is a (nonstop, stemmed, distinct) 
word of d e W and t d (w) its classical tfidf, then we define the categorical tfidf t c (w) of w for 



category c as follows: 



t c (w):= 1 + log Yi 



f f 

log 

V V 



The difference with the classical tfidf is in the calculation of idf: instead of the Xide-r use d 



by ( [Scholl et al. 20101, we focus on the set difference between Wikipedia and leaf pages c, 
and we use 1 + Xde^\jf(c) 1 i nstea d (the unit is added to avoid a zero df when the term does 
not appear outside the category). We believe that this extension of tfidf to categories improves 
discriminatory potential, even when the sets of leaves are large. 

Using this tool, we calculated the vectors of categories, defined as follows: 



c := 



Note, however, that for practical reasons, we limited the number of nonzero coordinates of 
each vector to 1,000, taking the tfidf of the thousand most frequent words in each category. 
This compromise has been useful since the number of distinct words increases gradually as we 
move towards the top of the hierarchy, and since every page is owned by category "Article", this 
also stands for every word and thus the vector of that category would have 339,679 nontrivial 
dimensions, which would make calculations too heavy for our equipment. 

Having obtained the vectors of all categories, we define the weight of semantic relatedness of any 
inclusion of categories c, — » c ; by 

p(c i -»c J 0= {c^Cj}, 

where ( ., .) is the Euclidean scalar product of two vectors. Since all vectors are unitary, we have 
Im(p) c [0, 1]. Similarly, for membership of pages to categories, we define: 

p(d i -»c J -)= (di,cj). 



Let W p be the weighted Wikipedia digraph (whose vertices are pages and categories, edges are 
memberships of pages and inclusions of categories, and the weight is defined above). Various 
algorithms exist to obtain minimum weight spanning trees, the most famous being those of 
Kruskal and Prim. Unfortunately, these algorithms apply only to undirected graphs. 

As for digraphs, there is not always a spanning tree respecting the orientation of the graph. For 
example, the graph 

9 — o 
o — o 

has no path of length > 1, even less a spanning tree. 

However, the situation is different for the special case of graphs with a specific property: the 
existence of a vertex from which all others can be reached by directed paths. In this case, there is 
always a (directed) spanning tree. 



This is easily demonstrated through the usual method of progressive cycle removal. Indeed, let there be 
a cycle abcdef . From our hypothesis, there is a vertex s from which one can reach all vertices, so we are 
necessarily in one of the following three cases: 




where dashed arrows represent directed paths. In the first case, all vertices of the cycle are reachable from 
outside the cycle. Then we can remove the entire cycle, since all of its vertices are already reached and part 
of the tree. In the second case, part of the cycle is essential to reach vertices, for example, b is reached only 
when going through a. In this case we simply remove the edges of the cycle which are last before vertices 
reached from outside the cycle (in the example, this concerns the edges be and fa). Finally, in the third 
case, a single vertex is reached by an outside path. In this case we remove the last edge before the vertex, 
in our case: ab. Doing this we can "break" all cycles while keeping the (weak) connectivity of the graph, 
and thus obtain a spanning tree. This proof of existence is not an efficient method for actually finding the 
spanning tree, since it provides us with no way of detecting cycles. 



One way to obtain such a spanning tree is by Chu-Liu & Edmonds' algorithm (Gabow et al. 1986 
p. 113-119), published for the first time in 1965. This semi-linear algorithm returns a minimum 
weight forest of trees covering the digraph. The orientation of these trees (which is compatible 
with that of the graph) follows a very strict principle: each tree has a vertex (which we call 
"root") from which all other vertices are reachable by directed paths. 

In the case of a graph that already has a vertex with such a property, the forest becomes a single 
tree, and we get a true directed minimal spanning tree of the graph. 

If we bothered to delve into such graph theory depths, it is because the Wikipedia category graph 
is indeed of this kind: when we reverse the membership/inclusion relations, then every page is 
reachable from the root, which is category "Article." 

Here is an example of the results of this arborification^ We took the page "Motif BACH"^ and 
calculated the lattice which separates it from the category "Article." This lattice contains 532 
categories and 1,170 category inclusions (plus the 4 edges representing the membership of 
"Motif BACH" to the four categories "Johann Sebastian Bach" "Portail Musique classique /Articles 
lies," "Portail Musique /Articles lies," "Theorie de la musique"). In Fig. [2} the reader can see a 
representation of this subgraph of the Wikipedia graph. The edges belonging to the arborification 
of Wikipedia are drawn in thick light blue lines, those belonging to the graph but not to the 
tree, in thin brown lines, and finally the path connecting "Motif Bach" to "Article," in dark green. 
Note that the algorithm has chosen "Theorie de la musique" rather than the category carrying 

2. The term "arborification," meaning "transformation into a tree" while not frequent, is nevertheless used in some 
knowledge domains: pure mathematics (arborification of multiple divergent series) Latin literature (arborification of 
Baucis in a lime tree and Philemon in an oak, in Ovid, Metamorphoses VII, 714 - 719) and (!!) in the interpretation 
of dreams. Ironically, the word "arborification" does not appear in French Wikipedia, whose search engine suggests 
"aridification" instead (indeed, the Levenshtein distance between the two terms is only of 3 units) . 

3. It is about the musical theme B flat, A, C, B natural, that Bach inserted in the Art of the fugue. Other composers 
have used the same theme to honor Bach and so it has become a leitmotiv of the last four centuries. 




Figure 2: Graph of categories located between page "Motif BACH" and the top of the hierarchy 
(category "Article") . 



the composer's name as the semantically most related category. This can be explained by the 
fact that the page "Motif Bach" focuses rather on works by other composers than on those by 
Bach, despite the obvious trap that the page title and the category "Johann Sebastian Bach" both 
contain the composer's name. 

Remember, however, that our goal is not to find the most appropriate taxonomy, i.e., the most 
relevant path from a page to the top. To get it, it would be wiser to use a shortest path algorithm, 
such as the one by Dijkstra. This has already been proposed in ( Collin ^t al.| 2010 1, but for the 
metric of the number of edges; in our case we would rather use our weighting of the graph. 



4 Stratified Tfidf 



We will use the page ancestors in the minimal spanning tree to update tfidf values of words in 
the page vectors. Recall the definition of a Wikipedia page vector, given in Section|2| 

where t d (w) is the tfidf of word w for document d. We will change function t d , so that it takes 
the ancestor categories of page d into account. 

We were motivated by the following problem: sometimes one finds words with relatively high 
tfidfs in pages belonging to different contexts (e.g., in a slightly caricatural case: the word with 
highest tfidf of page "Roman noir" is. . . "mai-juin," (= May- June) because in the bibliography, a 
book published in May- June 2001 is cited thrice). Our method aims to reduce the influence of 
out-of-context words, the presence of which can decrease relevance of concept vectors. 

For this, we chose to give more importance to the tfidf of words being in the "right context." It 
is reasonable to assume that the "right context" is one that is consistent with the categories to 
which the page belongs, and, in particular, with the main category and other ancestors obtained 
by taking the tree path that leads to the root. But according to our definition, words have tfidf 
values in categories. So we just use these tfidf values to boost words that "survive" when moving 
to categories, that is to say, words that appear in both the page vector and in category vectors. 

To do this, we replace t d by t' d , defined as follows: 

t' d {w) = t d {w) + ^\t nKi) {w) 

i>0 

where n is the operation replacing an node of the tree by its parent, and n l is its ith iteration (for 
practical reasons, we restricted ourselves to i < 3). The coefficients A ; s must form a decreasing 
sequence, and we tested the following three cases: A t - = — t , A; = 2i _] — , A ; = 1. 

We call this new tfidf stratified because we move upwards through the different strata to provide 
a semantic consolidation of the page stratum, which is the lowest. 

We followed the same modus operandi as in section|2]and obtained the following results: 



Strata A, 


Dim. of concept subspace 


Number of SVs 


Precision 


ill 

2' 4' 8 


121,114 


1,280 


88.62% 


111 
10' 20' 40 
1,1,1 


119,918 


1,298 


88.50% 


119,593 


1,347 


87.90% 


(without strates) 


103,291 


665 


70.08% 



which are significantly better than the conventional ESA for the same corpus (accuracy of 
70.08%), This confirms our approach. Note that between standard and stratified ESA, the 
increased accuracy is consistent with the increase in the size of the subspace of concepts and in 
the number of support vectors (which practically doubled while the dimension of the subspace 
of concept vectors increased by only 20%). On the other hand, within the group of our three 
tests, we maintain the correlation between accuracy and dimension of the concept subspace but 



the correlation between precision and number of support vectors has been reversed. We also 
note that the variations between the first two types of strata are small, and that the caricatural 
choice of constant X t = 1 (which means that we give equal importance to the categories and to 
the page itself) has degraded performance, but only slightly Finally the last observation: needing 
121,114 concepts (= pages), means that we use 94% of Wikipedia pages, while the nonstratified 
method uses only 80% of them, which shows that our method was able to benefit more from the 
semantic richness of the corpus, mobilizing a larger number of pages. 



5 Conclusion and Hints for Further Research 

A study of the Wikipedia category graph has revealed that one can extract a directed spanning 
tree, whose root is the category "Article," the hierarchical summit of the corpus. By applying 
Chu-Liu & Edmonds' algorithm, we obtained a minimum spanning tree — minimum with respect 
to the metric induced from explicit semantic analysis and adapted to categories by the notion of 
categorical tfidf. In this tree, every page and every category have a unique list of ancestors. We 
use this list to define a stratified tfidf, whose strata are the the values of the tfidfs a given term 
takes in the various ancestor categories. 

We evaluated our method by a text classification task: stratified tfidf allowed us to improve the 
classification accuracy by 18%. 

Here are some hints for research to be done: 

1. reduce the size of arborified Wikipedia by removing the intermediate, nondiscriminative 
categories; 

2. quantify the relevance of the hierarchy obtained (both by frequentist methods and by 
comparing, for example, with the taxonomy of WordNet concepts), find its weak areas, 
compare with versions of Wikipedia in other languages; 

3. examine the impact of the restriction on the maximum size of vectors for high-level 
categories (in this article we placed an artificial limit of 1,000 nonzero values — when the 
leaves of categories are becoming more and more, this value can prove to be binding); 

4. aggregate our stratified measure with collocational and ontological components, as in 
( Haralambous and Klyuev|[2011| ); 

5. propose the notion of the "most relevant category" to Wikipedia users and use their answers 
to improve the system; 

6. propose an alternative measure, based on links between pages (or categories), proportional 
to the number of links (or link paths) between pages and inversely proportional to the 
length of these paths. Properly define this measure and compare it to ESA (which uses the 
number of links between pages to filter Wikipedia, but does not include it into semantic 
relatedness calculations); 

7. in this article as well as in the hints above, the amount of data processed is becoming 
increasingly monstrous (especially when applied to English Wikipedia) . Find the best 
dimensionality reduction methods to make them implementable; 

8. and, more generally, explore the applications of graph theory to the formidable 
mathematical-linguistic objects that are the different graphs extracted from Wikipedia. 
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